Two sets of PS and NPS descriptors data were imported: one comprising the descriptors for all solvents (used both in the training and validation phase diagrams) and one comprising the descriptors for a reduced set of solvents (used only in phase diagrams that are in the training subset). The required preprocessing steps were determined using the latter training solvents set and applied on the former training/validation solvents set.
PS_descriptors <- read.csv("Data/PS descriptors.csv")
NPS_descriptors <- read.csv("Data/NPS descriptors.csv")
PS_descriptors_training <- read.csv("Data/PS descriptors (training).csv")
NPS_descriptors_training <- read.csv("Data/NPS descriptors (training).csv")
PS_descriptors <- PS_descriptors[, colSums(is.na(PS_descriptors)) == 0]
PS_descriptors_training <- PS_descriptors_training[, colSums(is.na(PS_descriptors_training)) ==
0]
## Number of PS descriptors removed: 21
NPS_descriptors <- NPS_descriptors[, colSums(is.na(NPS_descriptors)) == 0]
NPS_descriptors_training <- NPS_descriptors_training[, colSums(is.na(NPS_descriptors_training)) ==
0]
## Number of NPS descriptors removed: 21
The number of descriptors was reduced by removing all descriptors that comprise less than 2 unique values (i.e., zero variance descriptors) or 2 unique values when 1 of them is only present once (i.e., nero-zero variance descriptors).
PS_descriptors_nzv <- nearZeroVar(PS_descriptors_training, freqCut = 7, uniqueCut = 23)
PS_descriptors_training <- PS_descriptors_training[-c(PS_descriptors_nzv)]
PS_descriptors <- PS_descriptors[-c(PS_descriptors_nzv)]
## Number of PS descriptors removed: 2040
NPS_descriptors_nzv <- nearZeroVar(NPS_descriptors_training, freqCut = 11, uniqueCut = 16)
NPS_descriptors_training <- NPS_descriptors_training[-c(NPS_descriptors_nzv)]
NPS_descriptors <- NPS_descriptors[-c(NPS_descriptors_nzv)]
## Number of NPS descriptors removed: 1997
First, the skewness of the descriptors was estimated as:
skewness_PS <- apply(PS_descriptors_training[, 2:641], 2, skewness)
skewness_NPS <- apply(NPS_descriptors_training[, 2:684], 2, skewness)
The results were plotted as histograms to visually access the overall level of the skewness in the solvent descriptors datasets for polar solvents and non-polar solvents.
The NPS descriptors are notably more skewed than the PS ones.
To reduce the descriptors skewness the Yeo-Johnson transformation (which performs better than the alternative Box-Cox transformation at normalising variables containing zeros and negative numbers, such as the ones here) was applied to all predictors.
Additionally, all predictors were scaled and centered, which is important for predictive model development where the best model is determined by calculating the difference between classifiers and new samples and ranking it according to scale, such as support vector machines.
PS_preprocessing <- preProcess(PS_descriptors_training[, 2:641], method = c("scale",
"center", "YeoJohnson"))
NPS_preprocessing <- preProcess(NPS_descriptors_training[, 2:684], method = c("scale",
"center", "YeoJohnson"))
## Polar solvent descriptors transormation
## Created from 9 samples and 640 variables
##
## Pre-processing:
## - centered (640)
## - ignored (0)
## - scaled (640)
## - Yeo-Johnson transformation (420)
##
## Lambda estimates for Yeo-Johnson transformation:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.9830 -0.5054 0.4226 0.3363 1.1346 2.9800
## Non-polar solvent descriptors transormation
## Created from 13 samples and 683 variables
##
## Pre-processing:
## - centered (683)
## - ignored (0)
## - scaled (683)
## - Yeo-Johnson transformation (416)
##
## Lambda estimates for Yeo-Johnson transformation:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.9623 -1.2822 -0.5927 -0.6617 -0.2935 2.9267
After the transformation, the overall skewness of the descriptors was significantly reduced.
A number of predictive models, especially linear models (e.g., logistic regression and linear discriminant analysis), require that the predictors used are not co-linear to avoid errors in estimating the contribution (weight) of each predictor to the model.
Hence,the cross-correlation between the predictors was calculated to assess the presence of co-linearity in the predictor data set.
As can be seen from the two correlation plots above, there is significant co-linearity in both descriptor datasets.
To reduce this co-linearity a Pearson two-tailed correlation test was used. A Pearson correlation of 0.735 (p = 0.01, 9 polar solvents) and of 0.641 (p = 0.01, 13 non-polar solvents) was considered significant.
Polar solvent descriptors
First, the cross-correlations between Dragon and PaDEL PS descriptors was reduced:
PS_corr_DP <- cor(PS_descriptors_training[, 8:641], method = "pearson")
PS_corr_names_DP <- findCorrelation(PS_corr_DP, cutoff = 0.735, names = TRUE, exact = TRUE)
PS_colnames_DP <- colnames(PS_descriptors_training[8:641])[!(colnames(PS_descriptors_training[8:641]) %in%
PS_corr_names_DP)]
## Number of descriptors removed: 602
Next, the cross-correlations between the reduced number of Dragon and PaDEL descriptors and the remaining PubChem and in-house calculated descriptors were manually removed. For this purpose, the cross-correlations between the descriptors were calculated (Pearson two-tailed correlation) and correlation values larger than 0.735 were treated as significant (assigned a "1"), while lower correlations values were treated as insignificant (assigned a "0") .The molecular descriptors self-correlations along the matrix diagonals were also assigned a value of 0 and a binary correlation matrix was obtained. All descriptors, which were significanly correlated to the in-house and PubChem descriptors were removed.
# Binary correlation matrix
PS_colnames_basic_DP <- c(colnames(PS_descriptors_training[2:7]), PS_colnames_DP)
PS_corr_basic_DP <- as.data.frame(cor(PS_descriptors_training[c(PS_colnames_basic_DP)],
method = "pearson"))
PS_corr_basic_DP[PS_corr_basic_DP <= abs(0.735)] <- 0
PS_corr_basic_DP[PS_corr_basic_DP > abs(0.735)] <- 1
diag(PS_corr_basic_DP) <- 0
# Removal of significantly correlated descriptors
PS_corr_basic_DP <- PS_corr_basic_DP[PS_corr_basic_DP["AmphM_PS"] == 0, PS_corr_basic_DP["AmphM_PS"] ==
0]
PS_corr_basic_DP <- PS_corr_basic_DP[PS_corr_basic_DP["Hydrogen.Bond.Acceptor.Count_PS"] ==
0, PS_corr_basic_DP["Hydrogen.Bond.Acceptor.Count_PS"] == 0]
## Number of descriptors removed: 4
## Significant correlations left: 0
Non-polar solvent descriptors
An equivalent methodology was employed to remove the significant cross-correlations (Pearson correlation coefficient >= 0.641) withing the NPS molecular descriptors dataset.
NPS_corr_DP <- cor(NPS_descriptors_training[, 8:684], method = "pearson")
NPS_corr_names_DP <- findCorrelation(NPS_corr_DP, cutoff = 0.641, names = TRUE, exact = TRUE)
NPS_colnames_DP <- colnames(NPS_descriptors_training[8:684])[!(colnames(NPS_descriptors_training[8:684]) %in%
NPS_corr_names_DP)]
## Number of descriptors removed: 655
# Binary correlation matrix
NPS_colnames_basic_DP <- c(colnames(NPS_descriptors_training[2:7]), NPS_colnames_DP)
NPS_corr_basic_DP <- as.data.frame(cor(NPS_descriptors_training[c(NPS_colnames_basic_DP)],
method = "pearson"))
NPS_corr_basic_DP[NPS_corr_basic_DP <= abs(0.735)] <- 0
NPS_corr_basic_DP[NPS_corr_basic_DP > abs(0.735)] <- 1
diag(NPS_corr_basic_DP) <- 0
# Removal of significantly correlated descriptors
NPS_corr_basic_DP <- NPS_corr_basic_DP[NPS_corr_basic_DP["Molecular.volume..A3._NPS"] ==
0, NPS_corr_basic_DP["Molecular.volume..A3._NPS"] == 0]
## Number of descriptors removed: 4
## Significant correlations left: 0
A total of 34 polar solvent descriptors and 24 non-polar solvent descriptors remained after the above preprocessing steps.
## Polar solvent descriptors
## [1] "Molecular.volume..A3._PS" "XlogP3_PS"
## [3] "Hydrogen.Bond.Donor.Count_PS" "Hydrogen.Bond.Acceptor.Count_PS"
## [5] "AmphM_PS" "J_dragon_PS"
## [7] "MPC05_dragon_PS" "IDDE_dragon_PS"
## [9] "BIC0_dragon_PS" "CIC3_dragon_PS"
## [11] "GGI4_dragon_PS" "SPAM_dragon_PS"
## [13] "MEcc_dragon_PS" "P2u_dragon_PS"
## [15] "E1u_dragon_PS" "L1v_dragon_PS"
## [17] "L3s_dragon_PS" "De_dragon_PS"
## [19] "Ds_dragon_PS" "nOHs_dragon_PS"
## [21] "C.001_dragon_PS" "H.052_dragon_PS"
## [23] "ALOGP2_dragon_PS" "B03.O.O._dragon_PS"
## [25] "F01.C.C._dragon_PS" "VC.3_padel_PS"
## [27] "ASP.4_padel_PS" "VE1_D_padel_PS"
## [29] "PPSA.1_padel_PS" "DPSA.3_padel_PS"
## [31] "FNSA.2_padel_PS" "WNSA.1_padel_PS"
## [33] "MOMI.XY_padel_PS" "geomRadius_padel_PS"
## Non-polar solvent descriptors
## [1] "Molecular.volume..A3._NPS" "Hydrogen.Bond.Acceptor.Count_NPS"
## [3] "AmphM_NPS" "X0A_dragon_NPS"
## [5] "Yindex_dragon_NPS" "IC2_dragon_NPS"
## [7] "SIC3_dragon_NPS" "IC4_dragon_NPS"
## [9] "DISPv_dragon_NPS" "L1u_dragon_NPS"
## [11] "G2u_dragon_NPS" "G2m_dragon_NPS"
## [13] "E2e_dragon_NPS" "G3s_dragon_NPS"
## [15] "E2s_dragon_NPS" "Dm_dragon_NPS"
## [17] "nCconj_dragon_NPS" "C.003_dragon_NPS"
## [19] "H.052_dragon_NPS" "Depressant.80_dragon_NPS"
## [21] "PetitjeanNumber_padel_NPS" "JGI8_padel_NPS"
## [23] "JGI9_padel_NPS" "WNSA.1_padel_NPS"
The results were used to reduce the number of predictors in the descriptor datasets.
PS_colnames <- c("PS", colnames(PS_corr_basic_DP))
PS_descriptors <- PS_descriptors[c(PS_colnames)]
NPS_colnames <- c("NPS", colnames(NPS_corr_basic_DP))
NPS_descriptors <- NPS_descriptors[c(NPS_colnames)]